Looking under the Hood of Stochastic Machine Learning Algorithms for Parts of Speech Tagging

نویسندگان

Jana Diesner

Kathleen M. Carley

چکیده

The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, or the U.S. government. We are grateful to Alex Rudnicky from CMU for providing the training data to us and to Yifen Huang, CMU, for discussing the project with us. Abstract A variety of Natural Language Processing and Information Extraction tasks, such as question answering and named entity recognition, can benefit from precise knowledge about a words " syntactic category or Part of Speech (POS) (Stolz, Tannenbaum et al. 1965; Church 1988; Rabiner 1989). POS taggers are widely used to assign a single best POS to every word in text data, with stochastic approaches achieving accuracy rates of up to 96 to 97 percent (Jurafsky and Martin 2000). When building a POS tagger, human beings needs to make a set of decisions, some of which significantly impact the accuracy and other performance aspects of the resulting engine. In this paper we provide an overview of these decisions and empirically determine their impact on POS tagging accuracy. We envision the gained insights to be a valuable contribution for people who want to design, implement, modify, fine-tune, integrate, or simple reasonably use a POS tagger. Based on the results presented herein we built and integrated a POS tagger into AutoMap, a tool that facilitates Natural Language Processing and relational text analysis, as a stand-alone feature as well as an auxiliary for other tasks.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Looking under the Hood of Stochastic Machine Learning Algorithms for

This work was supported in part by the Army Research Lab as part of the CTA in Decisi 01-2-0009, the Army Research Institute W91WAW07C0063, and the National Science Foundation IGERT 9972762 in CASOS. Additional support was provided by CASOS and ISR at Carnegie Mellon University. The views and conclusions contained in this document are those of the authors and should not be interpreted as repres...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

A Core-Tools Statistical NLP Course

In the fall term of 2004, I taught a new statistical NLP course focusing on core tools and machine-learning algorithms. The course work was organized around four substantial programming assignments in which the students implemented the important parts of several core tools, including language models (for speech reranking), a maximum entropy classifier, a part-of-speech tagger, a PCFG parser, an...

متن کامل

مقایسه روش‌های مختلف یادگیری ماشین در خلاصه‌سازی استخراجی گفتار به گفتار فارسی بدون استفاده از رونوشت

In this paper, extractive speech summarization using different machine learning algorithms was investigated. The task of Speech summarization deals with extracting important and salient segments from speech in order to access, search, extract and browse speech files easier and in a less costly manner. In this paper, a new method for speech summarization without using automatic speech recognitio...

متن کامل

A Hybrid Optimization Algorithm for Learning Deep Models

Deep learning is one of the subsets of machine learning that is widely used in Artificial Intelligence (AI) field such as natural language processing and machine vision. The learning algorithms require optimization in multiple aspects. Generally, model-based inferences need to solve an optimized problem. In deep learning, the most important problem that can be solved by optimization is neural n...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Looking under the Hood of Stochastic Machine Learning Algorithms for Parts of Speech Tagging

نویسندگان

چکیده

منابع مشابه

Looking under the Hood of Stochastic Machine Learning Algorithms for

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

A Core-Tools Statistical NLP Course

مقایسه روش‌های مختلف یادگیری ماشین در خلاصه‌سازی استخراجی گفتار به گفتار فارسی بدون استفاده از رونوشت

A Hybrid Optimization Algorithm for Learning Deep Models

عنوان ژورنال:

اشتراک گذاری